The 2019-2020 Coronavirus Pandemic Analysis

Contact: Smith Research

BACKGROUND & APPROACH

I wanted to track and trend the coronavirus outbreak on my own curiosity. There are some interesting questions that may fall out of this, as it is a very historic moment, including scientifically and analytically (we have a large amount of data being shared across the globe, analyzed in real-time). The world has come to a halt because of it.
This analysis attempts to answer the following questions (more to come):

  1. What does the trend of the pandemic look like to date?
  2. What are future case predictions based on historical model?
  3. What interesting quirks or patterns emerge?

ASSUMPTIONS & LIMITATIONS: * This data is limited by the source. I realized early on that depending on source there were conflicting # of cases. Originally I was using JHU data… but this was always ‘ahead’ of the Our World In Data. I noticed that JHU’s website was buggy- you clicked on the U.S. stats but it didn’t reflect the U.S.. So I changed data sources to be more consistent with what is presented in the media (and Our World In Data has more extensive plots I can compare my own to). An interesting aside might be why the discrepancy? Was I missing something?
* Defintiions are important as is the idea that multiple varibales accumulate in things like total cases (more testing for example).

SOURCE RAW DATA: * https://ourworldindata.org/coronavirus
* https://github.com/CSSEGISandData/COVID-19/
*

INPUT DATA LOCATION: github (https://github.com/sbs87/coronavirus/tree/master/data)

OUTPUT DATA LOCATIOn: github (https://github.com/sbs87/coronavirus/tree/master/results)

TIMESTAMP

Start: ##—— Sat Jun 27 20:27:43 2020 ——##

PRE-ANALYSIS

The following sections are outside the scope of the ‘analysis’ but are still needed to prepare everything

UPSTREAM PROCESSING/ANALYSIS

  1. Google Mobility Scraping, script available at get_google_mobility.py
# Mobility data has to be extracted from Google PDF reports using a web scraping script (python , written by Peter Simone, https://github.com/petersim1/MIT_COVID19)

# See get_google_mobility.py for local script 

python3 get_google_mobility.py
# writes csv file of mobility data as "mobility.csv"

SET UP ENVIORNMENT

Load libraries and set global variables

# timestamp start
timestamp()
## ##------ Sat Jun 27 20:27:43 2020 ------##

# clear previous enviornment
rm(list = ls())

##------------------------------------------
## LIBRARIES
##------------------------------------------
library(plyr)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.0     ✓ purrr   0.3.3
## ✓ tibble  3.0.0     ✓ dplyr   0.8.5
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::arrange()   masks plyr::arrange()
## x purrr::compact()   masks plyr::compact()
## x dplyr::count()     masks plyr::count()
## x dplyr::failwith()  masks plyr::failwith()
## x dplyr::filter()    masks stats::filter()
## x dplyr::id()        masks plyr::id()
## x dplyr::lag()       masks stats::lag()
## x dplyr::mutate()    masks plyr::mutate()
## x dplyr::rename()    masks plyr::rename()
## x dplyr::summarise() masks plyr::summarise()
## x dplyr::summarize() masks plyr::summarize()
library(ggplot2)
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(plot.utils)
library(utils)
library(knitr)

##------------------------------------------

##------------------------------------------
# GLOBAL VARIABLES
##------------------------------------------
user_name <- Sys.info()["user"]
working_dir <- paste0("/Users/", user_name, "/Projects/coronavirus/")  # don't forget trailing /
results_dir <- paste0(working_dir, "results/")  # assumes diretory exists
results_dir_custom <- paste0(results_dir, "custom/")  # assumes diretory exists


Corona_Cases.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"
Corona_Cases.US.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv"
Corona_Deaths.US.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_US.csv"
Corona_Deaths.source_url <- "https://github.com/CSSEGISandData/COVID-19/raw/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv"

Corona_Cases.fn <- paste0(working_dir, "data/", basename(Corona_Cases.source_url))
Corona_Cases.US.fn <- paste0(working_dir, "data/", basename(Corona_Cases.US.source_url))
Corona_Deaths.fn <- paste0(working_dir, "data/", basename(Corona_Deaths.source_url))
Corona_Deaths.US.fn <- paste0(working_dir, "data/", basename(Corona_Deaths.US.source_url))
default_theme <- theme_bw() + theme(text = element_text(size = 14))  # fix this
##------------------------------------------

FUNCTIONS

List of functions

function_name description
prediction_model outputs case estumate for given log-linear moder parameters slope and intercept
make_long converts input data to long format (specialized cases)
name_overlaps outputs the column names intersection and set diffs of two data frame
find_linear_index finds the first date at which linearaity occurs
##------------------------------------------
## FUNCTION: prediction_model
##------------------------------------------
## --- //// ----
# Takes days vs log10 (case) linear model parameters and a set of days since 100 cases and outputs a dataframe with total number of predicted cases for those days
## --- //// ----
prediction_model<-function(m=1,b=0,days=1){
  total_cases<-m*days+b
  total_cases.log<-log(total_cases,10)
  prediction<-data.frame(days=days,Total_confirmed_cases_perstate=total_cases)
  return(prediction)
}
##------------------------------------------

##------------------------------------------
## FUNCTION: make_long
##------------------------------------------
## --- //// ----
# Takes wide-format case data and converts into long format, using date and total cases as variable/values. Also enforces standardization/assumes data struture naming by using fixed variable name, value name, id.vars, 
## --- //// ----
make_long<-function(data_in,variable.name = "Date",
                   value.name = "Total_confirmed_cases",
                   id.vars=c("case_type","Province.State","Country.Region","Lat","Long","City","Population")){

long_data<-melt(data_in,
                id.vars = id.vars,
                variable.name=variable.name,
                value.name=value.name)
return(long_data)

}
##------------------------------------------

## THIS WILL BE IN UTILS AT SOME POINT
name_overlaps<-function(df1,df2){
i<-intersect(names(df1),
names(df2))
sd1<-setdiff(names(df1),
names(df2))
sd2<-setdiff(names(df2),names(df1))
cat("intersection:\n",paste(i,"\n"))
cat("in df1 but not df2:\n",paste(sd1,"\n"))
cat("in df2 but not df1:\n",paste(sd2,"\n"))
return(list("int"=i,"sd_1_2"=sd1,"sd_2_1"=sd2))
}

##------------------------------------------

##------------------------------------------
## FUNCTION: find_linear_index
##------------------------------------------
## --- //// ----
# Find date at which total case data is linear (for a given data frame) 
## --- //// ----

find_linear_index<-function(tmp,running_avg=5){
  tmp$Total_confirmed_cases_perstate.log<-log(tmp$Total_confirmed_cases_perstate,2)
  derivative<-data.frame(matrix(nrow = nrow(tmp),ncol = 4))
  names(derivative)<-c("m.time","mm.time","cumsum","date")
  
  # First derivative
  for(t in 2:nrow(tmp)){
    slope.t<- tmp[t,"Total_confirmed_cases_perstate.log"]- tmp[t-1,"Total_confirmed_cases_perstate.log"]
    derivative[t,"m.time"]<-slope.t
    derivative[t,"date"]<-as.Date(tmp[t,"Date"])
  }
  
  # Second derivative
  for(t in 2:nrow(derivative)){
    slope.t<- derivative[t,"m.time"]- derivative[t-1,"m.time"]
    derivative[t,"mm.time"]<-slope.t
  }
  
  #Compute running sum of second derivative (window = 5). Choose point at which within 0.2
  for(t in running_avg:nrow(derivative)){
    slope.t<- sum(abs(derivative[t:(t-4),"mm.time"])<0.2,na.rm = T)
    derivative[t,"cumsum"]<-slope.t
  }
  
  #Find date -5 from the stablility point
  linear_begin<-min(derivative[!is.na(derivative$cumsum) & derivative$cumsum==running_avg,"date"])-running_avg
  
  return(linear_begin)
}

READ IN DATA

# Q: do we want to archive previous versions? Maybe an auto git mv?

##------------------------------------------
## Download and read in latest data from github
##------------------------------------------
download.file(Corona_Cases.source_url, destfile = Corona_Cases.fn)
Corona_Totals.raw <- read.csv(Corona_Cases.fn, header = T, stringsAsFactors = F)

download.file(Corona_Cases.US.source_url, destfile = Corona_Cases.US.fn)
Corona_Totals.US.raw <- read.csv(Corona_Cases.US.fn, header = T, stringsAsFactors = F)

download.file(Corona_Deaths.source_url, destfile = Corona_Deaths.fn)
Corona_Deaths.raw <- read.csv(Corona_Deaths.fn, header = T, stringsAsFactors = F)

download.file(Corona_Deaths.US.source_url, destfile = Corona_Deaths.US.fn)
Corona_Deaths.US.raw <- read.csv(Corona_Deaths.US.fn, header = T, stringsAsFactors = F)

# latest date on all data:
paste("US deaths:", names(Corona_Deaths.US.raw)[ncol(Corona_Deaths.US.raw)])
## [1] "US deaths: X6.26.20"
paste("US total:", names(Corona_Totals.US.raw)[ncol(Corona_Totals.US.raw)])
## [1] "US total: X6.26.20"
paste("World deaths:", names(Corona_Deaths.raw)[ncol(Corona_Deaths.raw)])
## [1] "World deaths: X6.26.20"
paste("World total:", names(Corona_Totals.raw)[ncol(Corona_Totals.raw)])
## [1] "World total: X6.26.20"

PROCESS DATA

  • Convert to long format
  • Fix date formatting/convert to numeric date
  • Log10 transform total # cases
##------------------------------------------
## Combine death and total data frames
##------------------------------------------
Corona_Totals.raw$case_type<-"total"
Corona_Totals.US.raw$case_type<-"total"
Corona_Deaths.raw$case_type<-"death"
Corona_Deaths.US.raw$case_type<-"death"

# for some reason, Population listed in US death file but not for other data... Weird. When combining, all datasets will have this column, but US deaths is the only useful one.  
Corona_Totals.US.raw$Population<-"NA" 
Corona_Totals.raw$Population<-"NA"
Corona_Deaths.raw$Population<-"NA"

Corona_Cases.raw<-rbind(Corona_Totals.raw,Corona_Deaths.raw)
Corona_Cases.US.raw<-rbind(Corona_Totals.US.raw,Corona_Deaths.US.raw)
#TODO: custom utils- setdiff, intersect names... option to output in merging too
##------------------------------------------
# prepare raw datasets for eventual combining
##------------------------------------------
Corona_Cases.raw$City<-"NA" # US-level data has Cities
Corona_Cases.US.raw$Country_Region<-"US_state" # To differentiate from World-level stats

Corona_Cases.US.raw<-plyr::rename(Corona_Cases.US.raw,c("Province_State"="Province.State",
                                                  "Country_Region"="Country.Region",
                                                  "Long_"="Long",
                                                  "Admin2"="City"))


##------------------------------------------
## Convert to long format
##------------------------------------------
#JHU has a gross file format. It's in wide format with each column is the date in MM/DD/YY. So read this in as raw data but trasnform it to be better suited for analysis
# Furthermore, the World and US level data is formatted differently, containing different columns, etc. Recitfy this and combine the world-level stats with U.S. level stats.

Corona_Cases.long<-rbind(make_long(select(Corona_Cases.US.raw,-c(UID,iso2,iso3,code3,FIPS,Combined_Key))),
make_long(Corona_Cases.raw))


##------------------------------------------
## Fix date formatting, convert to numeric date
##------------------------------------------
Corona_Cases.long$Date<-gsub(Corona_Cases.long$Date,pattern = "^X",replacement = "0") # leading 0 read in as X
Corona_Cases.long$Date<-gsub(Corona_Cases.long$Date,pattern = "20$",replacement = "2020") # ends in .20 and not 2020
Corona_Cases.long$Date<-as.Date(Corona_Cases.long$Date,format = "%m.%d.%y")
Corona_Cases.long$Date.numeric<-as.numeric(Corona_Cases.long$Date)

kable(table(select(Corona_Cases.long,c("Country.Region","case_type"))),caption = "Number of death and total case longitudinal datapoints per geographical region")
Number of death and total case longitudinal datapoints per geographical region
death total
Afghanistan 157 157
Albania 157 157
Algeria 157 157
Andorra 157 157
Angola 157 157
Antigua and Barbuda 157 157
Argentina 157 157
Armenia 157 157
Australia 1256 1256
Austria 157 157
Azerbaijan 157 157
Bahamas 157 157
Bahrain 157 157
Bangladesh 157 157
Barbados 157 157
Belarus 157 157
Belgium 157 157
Belize 157 157
Benin 157 157
Bhutan 157 157
Bolivia 157 157
Bosnia and Herzegovina 157 157
Botswana 157 157
Brazil 157 157
Brunei 157 157
Bulgaria 157 157
Burkina Faso 157 157
Burma 157 157
Burundi 157 157
Cabo Verde 157 157
Cambodia 157 157
Cameroon 157 157
Canada 2198 2198
Central African Republic 157 157
Chad 157 157
Chile 157 157
China 5181 5181
Colombia 157 157
Comoros 157 157
Congo (Brazzaville) 157 157
Congo (Kinshasa) 157 157
Costa Rica 157 157
Cote d’Ivoire 157 157
Croatia 157 157
Cuba 157 157
Cyprus 157 157
Czechia 157 157
Denmark 471 471
Diamond Princess 157 157
Djibouti 157 157
Dominica 157 157
Dominican Republic 157 157
Ecuador 157 157
Egypt 157 157
El Salvador 157 157
Equatorial Guinea 157 157
Eritrea 157 157
Estonia 157 157
Eswatini 157 157
Ethiopia 157 157
Fiji 157 157
Finland 157 157
France 1727 1727
Gabon 157 157
Gambia 157 157
Georgia 157 157
Germany 157 157
Ghana 157 157
Greece 157 157
Grenada 157 157
Guatemala 157 157
Guinea 157 157
Guinea-Bissau 157 157
Guyana 157 157
Haiti 157 157
Holy See 157 157
Honduras 157 157
Hungary 157 157
Iceland 157 157
India 157 157
Indonesia 157 157
Iran 157 157
Iraq 157 157
Ireland 157 157
Israel 157 157
Italy 157 157
Jamaica 157 157
Japan 157 157
Jordan 157 157
Kazakhstan 157 157
Kenya 157 157
Korea, South 157 157
Kosovo 157 157
Kuwait 157 157
Kyrgyzstan 157 157
Laos 157 157
Latvia 157 157
Lebanon 157 157
Lesotho 157 157
Liberia 157 157
Libya 157 157
Liechtenstein 157 157
Lithuania 157 157
Luxembourg 157 157
Madagascar 157 157
Malawi 157 157
Malaysia 157 157
Maldives 157 157
Mali 157 157
Malta 157 157
Mauritania 157 157
Mauritius 157 157
Mexico 157 157
Moldova 157 157
Monaco 157 157
Mongolia 157 157
Montenegro 157 157
Morocco 157 157
Mozambique 157 157
MS Zaandam 157 157
Namibia 157 157
Nepal 157 157
Netherlands 785 785
New Zealand 157 157
Nicaragua 157 157
Niger 157 157
Nigeria 157 157
North Macedonia 157 157
Norway 157 157
Oman 157 157
Pakistan 157 157
Panama 157 157
Papua New Guinea 157 157
Paraguay 157 157
Peru 157 157
Philippines 157 157
Poland 157 157
Portugal 157 157
Qatar 157 157
Romania 157 157
Russia 157 157
Rwanda 157 157
Saint Kitts and Nevis 157 157
Saint Lucia 157 157
Saint Vincent and the Grenadines 157 157
San Marino 157 157
Sao Tome and Principe 157 157
Saudi Arabia 157 157
Senegal 157 157
Serbia 157 157
Seychelles 157 157
Sierra Leone 157 157
Singapore 157 157
Slovakia 157 157
Slovenia 157 157
Somalia 157 157
South Africa 157 157
South Sudan 157 157
Spain 157 157
Sri Lanka 157 157
Sudan 157 157
Suriname 157 157
Sweden 157 157
Switzerland 157 157
Syria 157 157
Taiwan* 157 157
Tajikistan 157 157
Tanzania 157 157
Thailand 157 157
Timor-Leste 157 157
Togo 157 157
Trinidad and Tobago 157 157
Tunisia 157 157
Turkey 157 157
Uganda 157 157
Ukraine 157 157
United Arab Emirates 157 157
United Kingdom 1727 1727
Uruguay 157 157
US 157 157
US_state 511977 511977
Uzbekistan 157 157
Venezuela 157 157
Vietnam 157 157
West Bank and Gaza 157 157
Western Sahara 157 157
Yemen 157 157
Zambia 157 157
Zimbabwe 157 157
# Decouple population and lat/long data, refactor to make it more tidy
metadata_columns<-c("Lat","Long","Population")
metadata<-unique(select(filter(Corona_Cases.long,case_type=="death"),c("Country.Region","Province.State","City",all_of(metadata_columns))))
Corona_Cases.long<-select(Corona_Cases.long,-all_of(metadata_columns))

# Some counties are not summarized on the country level. collapse all but US
Corona_Cases.long<-rbind.fill(ddply(filter(Corona_Cases.long,!Country.Region=="US_state"),c("case_type","Country.Region","Date","Date.numeric"),summarise,Total_confirmed_cases=sum(Total_confirmed_cases)),filter(Corona_Cases.long,Country.Region=="US_state"))

# Put total case and deaths side-by-side (wide)
Corona_Cases<-spread(Corona_Cases.long,key = case_type,value = Total_confirmed_cases)

#Compute moratlity rate
Corona_Cases$mortality_rate<-Corona_Cases$death/Corona_Cases$total

#TMP
Corona_Cases<-plyr::rename(Corona_Cases,c("total"="Total_confirmed_cases","death"="Total_confirmed_deaths"))

##------------------------------------------
## log10 transform total # cases
##------------------------------------------
Corona_Cases$Total_confirmed_cases.log<-log(Corona_Cases$Total_confirmed_cases,10)
Corona_Cases$Total_confirmed_deaths.log<-log(Corona_Cases$Total_confirmed_deaths,10)
##------------------------------------------
       
##------------------------------------------
## Compute # of days since 100th for US data
##------------------------------------------

# Find day that 100th case was found for Country/Province. NOTE: Non US countries may have weird provinces. For example, Frane is summairzed at the country level but also had 3 providences. I've only ensured the U.S. case100 works... so the case100_date for U.S. is summarized both for the entire country (regardless of state) and on a per-state level. 
# TODO: consider city-level summary as well. This data may be sparse

Corona_Cases<-merge(Corona_Cases,ddply(filter(Corona_Cases,Total_confirmed_cases>100),c("Country.Region"),summarise,case100_date=min(Date.numeric)))
Corona_Cases$Days_since_100<-Corona_Cases$Date.numeric-Corona_Cases$case100_date

##------------------------------------------
## Add population and lat/long data (CURRENTLY US ONLY)
##------------------------------------------

kable(filter(metadata,(is.na(Country.Region) | is.na(Population) )) %>% select(c("Country.Region","Province.State","City")) %>% unique(),caption = "Regions for which either population or Country is NA")
Regions for which either population or Country is NA
Country.Region Province.State City
# Drop missing data 
metadata<-filter(metadata,!(is.na(Country.Region) | is.na(Population) ))
# Convert remaining pop to numeric
metadata$Population<-as.numeric(metadata$Population)
## Warning: NAs introduced by coercion
# Add metadata to cases
Corona_Cases<-merge(Corona_Cases,metadata,all.x = T)

##------------------------------------------
## Compute total and death cases relative to population 
##------------------------------------------

Corona_Cases$Total_confirmed_cases.per100<-100*Corona_Cases$Total_confirmed_cases/Corona_Cases$Population
Corona_Cases$Total_confirmed_deaths.per100<-100*Corona_Cases$Total_confirmed_deaths/Corona_Cases$Population


##------------------------------------------
## Filter df for US state-wide stats
##------------------------------------------

Corona_Cases.US_state<-filter(Corona_Cases,Country.Region=="US_state" & Total_confirmed_cases>0 )
Corona_Cases.US_state[!is.na(Corona_Cases.US_state$Province.State) & Corona_Cases.US_state$Province.State=="Pennsylvania" & Corona_Cases.US_state$City=="Delaware","City"]<-"Delaware_PA"

kable(table(select(Corona_Cases.US_state,c("Province.State"))),caption = "Number of longitudinal datapoints (total/death) per state")
Number of longitudinal datapoints (total/death) per state
Var1 Freq
Alabama 6369
Alaska 1326
Arizona 1541
Arkansas 6780
California 5871
Colorado 5749
Connecticut 916
Delaware 387
Diamond Princess 102
District of Columbia 103
Florida 6674
Georgia 15162
Grand Princess 103
Guam 103
Hawaii 527
Idaho 3116
Illinois 8998
Indiana 8769
Iowa 8480
Kansas 7255
Kentucky 10185
Louisiana 6295
Maine 1610
Maryland 2417
Massachusetts 1520
Michigan 7852
Minnesota 7522
Mississippi 7818
Missouri 9184
Montana 2820
Nebraska 5645
Nevada 1267
New Hampshire 1093
New Jersey 2301
New Mexico 2772
New York 5889
North Carolina 9336
North Dakota 3494
Northern Mariana Islands 88
Ohio 8293
Oklahoma 6503
Oregon 3195
Pennsylvania 6477
Puerto Rico 103
Rhode Island 610
South Carolina 4526
South Dakota 4416
Tennessee 9017
Texas 19888
Utah 1396
Vermont 1470
Virgin Islands 103
Virginia 11901
Washington 4009
West Virginia 4504
Wisconsin 6434
Wyoming 2010
Corona_Cases.US_state<-merge(Corona_Cases.US_state,ddply(filter(Corona_Cases.US_state,Total_confirmed_cases>100),c("Province.State"),summarise,case100_date_state=min(Date.numeric)))
Corona_Cases.US_state$Days_since_100_state<-Corona_Cases.US_state$Date.numeric-Corona_Cases.US_state$case100_date_state

ANALYSIS

Q1: What is the trend in cases, mortality across geopgraphical regions?

Plot # of cases vs time
* For each geographical set:
* comparative longitudinal case trend (absolute & log scale)
* comparative longitudinal mortality trend
* death vs total correlation

question dataset x y color facet pch dimentions
comparative_longitudinal_case_trend long time log_cases geography none (case type?) case_type [15, 50, 4] geography x (2 scale?) case type
comparative longitudinal case trend long time cases geography case_type ? [15, 50, 4] geography x (2+ scale) case type
comparative longitudinal mortality trend wide time mortality rate geography none none [15, 50, 4] geography
death vs total correlation wide cases deaths geography none none [15, 50, 4] geography
# total cases vs time
# death cases vs time
# mortality rate vs time
# death vs mortality


  # death vs mortality
  # total & death case vs time (same plot)

#<question> <x> <y> <colored> <facet> <dataset>
## trend in case/deaths over time, comapred across regions <time> <log cases> <geography*> <none> <.wide>
## trend in case/deaths over time, comapred across regions <time> <cases> <geography*> <case_type> <.long>
## trend in mortality rate over time, comapred across regions <time> <mortality rate> <geography*> <none>
## how are death/mortality related/correlated? <time> <log cases> <geography*> <none>
## how are death and case load correlated? <cases> <deaths>

# lm for each?? - > apply lm from each region starting from 100th case. m, b associated with each.
    # input: geographical regsion, logcase vs day (100th case)
    # output: m, b for each geographical region ID



#total/death on same plot-  diffeer by 2 logs, so when plotting log, use pch. when plotting absolute, need to use free scales
#when plotting death and case on same, melt. 

#CoronaCases - > filter sets (3)
  #world - choose countries with sufficent data

N<-ddply(filter(Corona_Cases,Total_confirmed_cases>100),c("Country.Region"),summarise,n=length(Country.Region))
ggplot(filter(N,n<100),aes(x=n))+
  geom_histogram()+
  default_theme+
  ggtitle("Distribution of number of days with at least 100 confirmed cases for each region")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

kable(arrange(N,-n),caption="Sorted number of days with at least 100 confirmed cases")
Sorted number of days with at least 100 confirmed cases
Country.Region n
US_state 75660
China 157
Diamond Princess 138
Korea, South 128
Japan 127
Italy 125
Iran 122
Singapore 119
France 118
Germany 118
Spain 117
US 115
Switzerland 114
United Kingdom 114
Belgium 113
Netherlands 113
Norway 113
Sweden 113
Austria 111
Malaysia 110
Australia 109
Bahrain 109
Denmark 109
Canada 108
Qatar 108
Iceland 107
Brazil 106
Czechia 106
Finland 106
Greece 106
Iraq 106
Israel 106
Portugal 106
Slovenia 106
Egypt 105
Estonia 105
India 105
Ireland 105
Kuwait 105
Philippines 105
Poland 105
Romania 105
Saudi Arabia 105
Indonesia 104
Lebanon 104
Pakistan 104
San Marino 104
Thailand 104
Chile 103
Luxembourg 102
Peru 102
Russia 102
Ecuador 101
Mexico 101
Slovakia 101
South Africa 101
United Arab Emirates 101
Armenia 100
Colombia 100
Croatia 100
Panama 100
Serbia 100
Taiwan* 100
Turkey 100
Argentina 99
Bulgaria 99
Latvia 99
Uruguay 99
Algeria 98
Costa Rica 98
Dominican Republic 98
Hungary 98
Andorra 97
Bosnia and Herzegovina 97
Jordan 97
Lithuania 97
Morocco 97
New Zealand 97
North Macedonia 97
Vietnam 97
Albania 96
Cyprus 96
Malta 96
Moldova 96
Brunei 95
Burkina Faso 95
Sri Lanka 95
Tunisia 95
Ukraine 94
Azerbaijan 93
Ghana 93
Kazakhstan 93
Oman 93
Senegal 93
Venezuela 93
Afghanistan 92
Cote d’Ivoire 92
Cuba 91
Mauritius 91
Uzbekistan 91
Cambodia 90
Cameroon 90
Honduras 90
Nigeria 90
West Bank and Gaza 90
Belarus 89
Georgia 89
Bolivia 88
Kosovo 88
Kyrgyzstan 88
Montenegro 88
Congo (Kinshasa) 87
Kenya 86
Niger 85
Guinea 84
Rwanda 84
Trinidad and Tobago 84
Paraguay 83
Bangladesh 82
Djibouti 80
El Salvador 79
Guatemala 78
Madagascar 77
Mali 76
Congo (Brazzaville) 73
Jamaica 73
Gabon 71
Somalia 71
Tanzania 71
Ethiopia 70
Burma 69
Sudan 68
Liberia 67
Maldives 65
Equatorial Guinea 64
Cabo Verde 62
Sierra Leone 60
Guinea-Bissau 59
Togo 59
Zambia 58
Eswatini 57
Chad 56
Tajikistan 55
Haiti 53
Sao Tome and Principe 53
Benin 51
Nepal 51
Uganda 51
Central African Republic 50
South Sudan 50
Guyana 48
Mozambique 47
Yemen 43
Mongolia 42
Mauritania 39
Nicaragua 39
Malawi 33
Syria 33
Zimbabwe 31
Bahamas 30
Libya 30
Comoros 28
Suriname 20
Angola 17
Eritrea 12
Burundi 11
Monaco 5
Namibia 2
# Pick top 15 countries with data
max_colors<-12
# find way to fix this- China has diff provences. Plot doesnt look right...
sufficient_data<-arrange(filter(N,!Country.Region %in% c("US_state", "Diamond Princess")),-n)[1:max_colors,]
kable(sufficient_data,caption = paste0("Top ",max_colors," countries with sufficient data"))
Top 12 countries with sufficient data
Country.Region n
China 157
Korea, South 128
Japan 127
Italy 125
Iran 122
Singapore 119
France 118
Germany 118
Spain 117
US 115
Switzerland 114
United Kingdom 114
Corona_Cases.world<-filter(Corona_Cases,Country.Region %in% c(sufficient_data$Country.Region))


  #us 
  #    - by state
Corona_Cases.US<-filter(Corona_Cases,Country.Region=="US" & Total_confirmed_cases>0)
# summarize 
#!City %in% c("Unassigned") 
  #    - specific cities
#mortality_rate!=Inf & mortality_rate<=1
head(Corona_Cases)
##   Country.Region Province.State City       Date Date.numeric
## 1    Afghanistan           <NA> <NA> 2020-03-22        18343
## 2    Afghanistan           <NA> <NA> 2020-02-11        18303
## 3    Afghanistan           <NA> <NA> 2020-06-07        18420
## 4    Afghanistan           <NA> <NA> 2020-02-28        18320
## 5    Afghanistan           <NA> <NA> 2020-03-13        18334
## 6    Afghanistan           <NA> <NA> 2020-03-14        18335
##   Total_confirmed_deaths Total_confirmed_cases mortality_rate
## 1                      1                    40      0.0250000
## 2                      0                     0            NaN
## 3                    357                 20342      0.0175499
## 4                      0                     1      0.0000000
## 5                      0                     7      0.0000000
## 6                      0                    11      0.0000000
##   Total_confirmed_cases.log Total_confirmed_deaths.log case100_date
## 1                  1.602060                   0.000000        18348
## 2                      -Inf                       -Inf        18348
## 3                  4.308394                   2.552668        18348
## 4                  0.000000                       -Inf        18348
## 5                  0.845098                       -Inf        18348
## 6                  1.041393                       -Inf        18348
##   Days_since_100 Lat Long Population Total_confirmed_cases.per100
## 1             -5  NA   NA         NA                           NA
## 2            -45  NA   NA         NA                           NA
## 3             72  NA   NA         NA                           NA
## 4            -28  NA   NA         NA                           NA
## 5            -14  NA   NA         NA                           NA
## 6            -13  NA   NA         NA                           NA
##   Total_confirmed_deaths.per100
## 1                            NA
## 2                            NA
## 3                            NA
## 4                            NA
## 5                            NA
## 6                            NA
Corona_Cases[!is.na(Corona_Cases$Province.State) & Corona_Cases$Province.State=="Pennsylvania" & Corona_Cases$City=="Delaware","City"]<-"Delaware_PA"


Corona_Cases.UScity<-filter(Corona_Cases,Province.State %in% c("Pennsylvania","Maryland","New York","New Jersey") & City %in% c("Bucks","Baltimore City", "New York","Burlington","Cape May","Delaware_PA"))


measure_vars_long<-c("Total_confirmed_cases.log","Total_confirmed_cases","Total_confirmed_deaths","Total_confirmed_deaths.log")
melt_arg_list<-list(variable.name = "case_type",value.name = "cases",measure.vars = c("Total_confirmed_cases","Total_confirmed_deaths"))
melt_arg_list$data=NULL


melt_arg_list$data=select(Corona_Cases.world,-ends_with(match = "log"))
Corona_Cases.world.long<-do.call(melt,melt_arg_list)
melt_arg_list$data=select(Corona_Cases.UScity,-ends_with(match = "log"))
Corona_Cases.UScity.long<-do.call(melt,melt_arg_list)
melt_arg_list$data=select(Corona_Cases.US_state,-ends_with(match = "log"))
Corona_Cases.US_state.long<-do.call(melt,melt_arg_list)

Corona_Cases.world.long$cases.log<-log(Corona_Cases.world.long$cases,10)
Corona_Cases.US_state.long$cases.log<-log(Corona_Cases.US_state.long$cases,10)
Corona_Cases.UScity.long$cases.log<-log(Corona_Cases.UScity.long$cases,10)


# what is the current death and total case load for US? For world? For states?
#-absolute
#-log

# what is mortality rate (US, world)
#-absolute

#how is death and case correlated? (US, world)
#-absolute
#Corona_Cases.US<-filter(Corona_Cases,Country.Region=="US" & Total_confirmed_cases>0)
#Corona_Cases.US.case100<-filter(Corona_Cases.US, Days_since_100>=0)
# linear model parameters
#(model_fit<-lm(formula = Total_confirmed_cases.log~Days_since_100,data= Corona_Cases.US.case100 ))

#(slope<-model_fit$coefficients[2])
#(intercept<-model_fit$coefficients[1])

# Correlation coefficient
#cor(x = Corona_Cases.US.case100$Days_since_100,y = Corona_Cases.US.case100$Total_confirmed_cases.log)

##------------------------------------------
## Plot World Data
##------------------------------------------
# Timestamp for world
timestamp_plot.world<-paste("Most recent date for which data available:",max(Corona_Cases.world$Date))#timestamp(quiet = T,prefix = "Updated ",suffix = " (EST)")


# Base template for plots
baseplot.world<-ggplot(data=NULL,aes(x=Days_since_100,col=Country.Region))+
  default_theme+
  scale_color_brewer(type = "qualitative",palette = "Paired")+
  ggtitle(paste("Log10 cases over time,",timestamp_plot.world))+
  theme(legend.position = "bottom",plot.title = element_text(size=12))


##/////////////////////////
### Plot Longitudinal cases

(Corona_Cases.world.long.plot<-baseplot.world+
    geom_point(data=Corona_Cases.world.long,aes(y=cases))+
    geom_line(data=Corona_Cases.world.long,aes(y=cases))+
    facet_wrap(~case_type,scales = "free_y",ncol=1)+
    ggtitle(timestamp_plot.world)
    )

(Corona_Cases.world.loglong.plot<-baseplot.world+
    geom_point(data=Corona_Cases.world.long,aes(y=cases.log))+
    geom_line(data=Corona_Cases.world.long,aes(y=cases.log))+
    facet_wrap(~case_type,scales = "free_y",ncol=1)+
    ggtitle(timestamp_plot.world))

##/////////////////////////
### Plot Longitudinal mortality rate

(Corona_Cases.world.mortality.plot<-baseplot.world+
    geom_point(data=Corona_Cases.world,aes(y=mortality_rate))+
    geom_line(data=Corona_Cases.world,aes(y=mortality_rate))+
    ylim(c(0,0.3))+
    ggtitle(timestamp_plot.world))
## Warning: Removed 100 rows containing missing values (geom_point).
## Warning: Removed 100 row(s) containing missing values (geom_path).

##/////////////////////////
### Plot death vs total case correlation

(Corona_Cases.world.casecor.plot<-ggplot(Corona_Cases.world,aes(x=Total_confirmed_cases,y=Total_confirmed_deaths,col=Country.Region))+
  geom_point()+
  geom_line()+
  default_theme+
  scale_color_brewer(type = "qualitative",palette = "Paired")+
  ggtitle(paste("Log10 cases over time,",timestamp_plot.world))+
  theme(legend.position = "bottom",plot.title = element_text(size=12))+
    ggtitle(timestamp_plot.world))

### Write polots

write_plot(Corona_Cases.world.long.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.long.plot.png"
write_plot(Corona_Cases.world.loglong.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.loglong.plot.png"
write_plot(Corona_Cases.world.mortality.plot,wd = results_dir)
## Warning: Removed 100 rows containing missing values (geom_point).

## Warning: Removed 100 row(s) containing missing values (geom_path).
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.mortality.plot.png"
write_plot(Corona_Cases.world.casecor.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/Corona_Cases.world.casecor.plot.png"
##------------------------------------------
## Plot US State Data
##-----------------------------------------

baseplot.US<-ggplot(data=NULL,aes(x=Days_since_100_state,col=case_type))+
  default_theme+
  facet_wrap(~Province.State)+
  ggtitle(paste("Log10 cases over time,",timestamp_plot.world))

Corona_Cases.US_state.long.plot<-baseplot.US+geom_point(data=Corona_Cases.US_state.long,aes(y=cases.log))
##------------------------------------------
## Plot US City Data
##-----------------------------------------

Corona_Cases.US.plotdata<-filter(Corona_Cases.US_state,Province.State %in% c("Pennsylvania","Maryland","New York","New Jersey") &
                                   City %in% c("Bucks","Baltimore City", "New York","Burlington","Cape May","Delaware_PA") &
                                   Total_confirmed_cases>0) 
timestamp_plot<-paste("Most recent date for which data available:",max(Corona_Cases.US.plotdata$Date))#timestamp(quiet = T,prefix = "Updated ",suffix = " (EST)")

city_colors<-c("Bucks"='#beaed4',"Baltimore City"='#386cb0', "New York"='#7fc97f',"Burlington"='#fdc086',"Cape May"="#e78ac3","Delaware_PA"="#b15928")

##/////////////////////////
### Plot death vs total case correlation

(Corona_Cases.city.loglong.plot<-ggplot(melt(Corona_Cases.US.plotdata,measure.vars = c("Total_confirmed_cases.log","Total_confirmed_deaths.log"),variable.name = "case_type",value.name = "cases"),aes(x=Date,y=cases,col=City,pch=case_type))+
  geom_point(size=4)+
    geom_line()+
  default_theme+
  #facet_wrap(~case_type)+
    ggtitle(paste("Log10 total and death cases over time,",timestamp_plot))+
theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))+
    scale_color_manual(values = city_colors)+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

(Corona_Cases.city.long.plot<-ggplot(filter(Corona_Cases.US.plotdata,Province.State !="New York"),aes(x=Date,y=Total_confirmed_cases,col=City))+
  geom_point(size=4)+
  geom_line()+
  default_theme+
  facet_grid(~Province.State,scales = "free_y")+
  ggtitle(paste("MD, PA, NJ total cases over time,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))
+
  scale_color_manual(values = city_colors)+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

(Corona_Cases.city.mortality.plot<-ggplot(Corona_Cases.US.plotdata,aes(x=Date,y=mortality_rate,col=City))+
  geom_point(size=3)+
  geom_line(size=2)+
  default_theme+
  ggtitle(paste("Mortality rate (deaths/total) over time,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))+
  scale_color_manual(values = city_colors)+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

(Corona_Cases.city.casecor.plot<-ggplot(filter(Corona_Cases.US.plotdata,Province.State !="New York"),aes(y=Total_confirmed_deaths,x=Total_confirmed_cases,col=City))+
  geom_point(size=3)+
  geom_line(size=2)+
  default_theme+
  ggtitle(paste("Correlation of death vs total cases,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12))+
  scale_color_manual(values = city_colors))

(Corona_Cases.city.long.normalized.plot<-ggplot(filter(Corona_Cases.US.plotdata,Province.State !="New York"),aes(x=Date,y=Total_confirmed_cases.per100,col=City))+
  geom_point(size=4)+
  geom_line()+
  default_theme+
  facet_grid(~Province.State)+
  ggtitle(paste("MD, PA, NJ total cases over time per 100 people,",timestamp_plot))+
  theme(legend.position = "bottom",plot.title = element_text(size=12),axis.text.x = element_text(angle=45,hjust=1))+
  scale_color_manual(values = city_colors)  +
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

write_plot(Corona_Cases.city.long.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.long.plot.png"
write_plot(Corona_Cases.city.loglong.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.loglong.plot.png"
write_plot(Corona_Cases.city.mortality.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.mortality.plot.png"
write_plot(Corona_Cases.city.casecor.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.casecor.plot.png"
write_plot(Corona_Cases.city.long.normalized.plot,wd = results_dir_custom)
## [1] "/Users/stevensmith/Projects/coronavirus/results/custom/Corona_Cases.city.long.normalized.plot.png"

Q1b what is the model

Fit the cases to a linear model 1. Find time at which the case vs date becomes linear in each plot
2. Fit linear model for each city

# What is the predict # of cases for the next few days?
# How is the model performing historically?

Corona_Cases.US_state.summary<-ddply(Corona_Cases.US_state,
                                     c("Province.State","Date"),
                                     summarise,
                                     Total_confirmed_cases_perstate=sum(Total_confirmed_cases)) %>% 
    filter(Total_confirmed_cases_perstate>100)

# Compute the states with the most cases (for coloring and for linear model)
top_states_totals<-head(ddply(Corona_Cases.US_state.summary,c("Province.State"),summarise, Total_confirmed_cases_perstate.max=max(Total_confirmed_cases_perstate)) %>% arrange(-Total_confirmed_cases_perstate.max),n=max_colors)

kable(top_states_totals,caption = "Top 12 States, total count ")
top_states<-top_states_totals$Province.State

# Manually fix states so that Maryland is switched out for New York
top_states_modified<-c(top_states[top_states !="New York"],"Maryland")

# Plot with all states:
(Corona_Cases.US_state.summary.plot<-ggplot(Corona_Cases.US_state.summary,aes(x=Date,y=Total_confirmed_cases_perstate))+
  geom_point()+
  geom_point(data=filter(Corona_Cases.US_state.summary,Province.State %in% top_states),aes(col=Province.State))+
  scale_color_brewer(type = "qualitative",palette = "Paired")+
  default_theme+
  theme(axis.text.x = element_text(angle=45,hjust=1),legend.position = "bottom")+
  ggtitle("Total confirmed cases per state, top 12 colored")+
  scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

##------------------------------------------
## Fit linear model to time vs total cases
##-----------------------------------------

# First, find the date at which each state's cases vs time becomes lienar (2nd derivative is about 0)
li<-ddply(Corona_Cases.US_state.summary,c("Province.State"),find_linear_index)

# Compute linear model for each state starting at the point at which data becomes linear
for(i in 1:nrow(li)){
  Province.State.i<-li[i,"Province.State"]
  date.i<-li[i,"V1"]
  data.i<-filter(Corona_Cases.US_state.summary,Province.State==Province.State.i & as.numeric(Date) >= date.i)
  model_results<-lm(data.i,formula = Total_confirmed_cases_perstate~Date)
  slope<-model_results$coefficients[2]
  intercept<-model_results$coefficients[1]
  li[li$Province.State==Province.State.i,"m"]<-slope
  li[li$Province.State==Province.State.i,"b"]<-intercept
  }

# Compute top state case load with fitted model

(Corona_Cases.US_state.lm.plot<-ggplot(filter(Corona_Cases.US_state.summary,Province.State %in% top_states_modified ))+
    geom_abline(data=filter(li,Province.State %in% top_states_modified),
                aes(slope = m,intercept = b,col=Province.State),lty=2)+
    geom_point(aes(x=Date,y=Total_confirmed_cases_perstate,col=Province.State))+
    scale_color_brewer(type = "qualitative",palette = "Paired")+
    default_theme+
    theme(axis.text.x = element_text(angle=45,hjust=1),legend.position = "bottom")+
    ggtitle("Total confirmed cases per state, top 12 colored")+
    scale_x_date(date_breaks="1 week",date_minor_breaks="1 day"))

##------------------------------------------
## Predict the number of total cases over the next week
##-----------------------------------------

predicted_days<-c(0,1,2,3,7)+as.numeric(as.Date("2020-04-20"))

predicted_days_df<-data.frame(matrix(ncol=3))
names(predicted_days_df)<-c("Province.State","days","Total_confirmed_cases_perstate")

# USe model parameters to estiamte case loads
for(state.i in top_states_modified){
  predicted_days_df<-rbind(predicted_days_df,
                           data.frame(Province.State=state.i,
                                      prediction_model(m = li[li$Province.State==state.i,"m"],
                                                       b =li[li$Province.State==state.i,"b"] ,
                                                       days =predicted_days )))
  }

predicted_days_df$Date<-as.Date(predicted_days_df$days,origin="1970-01-01")

kable(predicted_days_df,caption = "Predicted total cases over the next week for selected states")

##------------------------------------------
## Write plots
##-----------------------------------------

write_plot(Corona_Cases.US_state.summary.plot,wd = results_dir)
write_plot(Corona_Cases.US_state.lm.plot,wd = results_dir)

##------------------------------------------
## Write tables
##-----------------------------------------

write.csv(predicted_days_df,file = paste0(results_dir,"predicted_total_cases_days.csv"),quote = F,row.names = F)

Q2: What is the predicted number of cases?

What is the prediction of COVID-19 based on model thus far? Additional questions:

WHy did it take to day 40 to start a log linear trend? How long will it be till x number of cases? When will the plateu happen? Are any effects noticed with social distancing? Delays

##------------------------------------------
## Prediction and Prediction Accuracy
##------------------------------------------


today_num<-max(Corona_Cases.US$Days_since_100)
predicted_days<-today_num+c(1,2,3,7)

#mods = dlply(mydf, .(x3), lm, formula = y ~ x1 + x2)
#today:
Corona_Cases.US[Corona_Cases.US$Days_since_100==(today_num-1),]
Corona_Cases.US[Corona_Cases.US$Days_since_100==today_num,]
Corona_Cases.US$type<-"Historical"


#prediction_values<-prediction_model(m=slope,b=intercept,days = predicted_days)$Total_confirmed_cases

histoical_model<-data.frame(date=today_num,m=slope,b=intercept)
tmp<-data.frame(state=rep(c("A","B"),each=3),x=c(1,2,3,4,5,6))
tmp$y<-c(tmp[1:3,"x"]+5,tmp[4:6,"x"]*5+1)
ddply(tmp,c("state"))
lm(data =tmp,formula = y~x )

train_lm<-function(input_data,subset_coulmn,formula_input){
case_models <- dlply(input_data, subset_coulmn, lm, formula = formula_input)
case_models.parameters <- ldply(case_models, coef)
case_models.parameters<-rename(case_models.parameters,c("b"="(Intercept)","m"=subset_coulmn))
return(case_models.parameters)
}

train_lm(tmp,"state")

 dlply(input_data, subset_coulmn, lm,m=)
 
# model for previous y days
#historical_model_predictions<-data.frame(day_x=NULL,Days_since_100=NULL,Total_confirmed_cases=NULL,Total_confirmed_cases.log=NULL)
# for(i in c(1,2,3,4,5,6,7,8,9,10)){
#   #i<-1
# day_x<-today_num-i # 1, 2, 3, 4
# day_x_nextweek<-day_x+c(1,2,3)
# model_fit_x<-lm(data = filter(Corona_Cases.US.case100,Days_since_100 < day_x),formula = Total_confirmed_cases.log~Days_since_100)
# prediction_day_x_nextweek<-prediction_model(m = model_fit_x$coefficients[2],b = model_fit_x$coefficients[1],days = day_x_nextweek)
# prediction_day_x_nextweek$type<-"Predicted"
# acutal_day_x_nextweek<-filter(Corona_Cases.US,Days_since_100 %in% day_x_nextweek) %>% select(c(Days_since_100,Total_confirmed_cases,Total_confirmed_cases.log))
# acutal_day_x_nextweek$type<-"Historical"
# historical_model_predictions.i<-data.frame(day_x=day_x,rbind(acutal_day_x_nextweek,prediction_day_x_nextweek))
# historical_model_predictions<-rbind(historical_model_predictions.i,historical_model_predictions)
# }

#historical_model_predictions.withHx<-rbind.fill(historical_model_predictions,data.frame(Corona_Cases.US,type="Historical"))
#historical_model_predictions.withHx$Total_confirmed_cases.log2<-log(historical_model_predictions.withHx$Total_confirmed_cases,2)

(historical_model_predictions.plot<-ggplot(historical_model_predictions.withHx,aes(x=Days_since_100,y=Total_confirmed_cases.log,col=type))+
    geom_point(size=3)+
    default_theme+
    theme(legend.position = "bottom")+ 
      #geom_abline(slope = slope,intercept =intercept,lty=2)+
    #facet_wrap(~case_type,ncol=1)+
    scale_color_manual(values = c("Historical"="#377eb8","Predicted"="#e41a1c")))
write_plot(historical_model_predictions.plot,wd=results_dir)

Q3: What is the effect on social distancing, descreased mobility on case load?

Load data from Google which compoutes % change in user mobility relative to baseline for * Recreation
* Workplace
* Residence
* Park
* Grocery

Data from https://www.google.com/covid19/mobility/

# See pre-processing section for script on gathering mobility data

# UNDER DEVELOPMENT

mobility<-read.csv("/Users/stevensmith/Projects/MIT_COVID19/mobility.csv",header = T,stringsAsFactors = F)
#mobility$Retail_Recreation<-as.numeric(sub(mobility$Retail_Recreation,pattern = "%",replacement = ""))
#mobility$Workplace<-as.numeric(sub(mobility$Workplace,pattern = "%",replacement = ""))
#mobility$Residential<-as.numeric(sub(mobility$Residential,pattern = "%",replacement = ""))

##------------------------------------------
## Show relationship between mobility and caseload
##------------------------------------------
mobility$County<-gsub(mobility$County,pattern = " County",replacement = "")
Corona_Cases.US_state.mobility<-merge(Corona_Cases.US_state,plyr::rename(mobility,c("State"="Province.State","County"="City")))

#Corona_Cases.US_state.tmp<-merge(metadata,Corona_Cases.US_state.tmp)
# Needs to happen upsteam, see todos
#Corona_Cases.US_state.tmp$Total_confirmed_cases.perperson<-Corona_Cases.US_state.tmp$Total_confirmed_cases/as.numeric(Corona_Cases.US_state.tmp$Population)
mobility_measures<-c("Retail_Recreation","Grocery_Pharmacy","Parks","Transit","Workplace","Residential")

plot_data<-filter(Corona_Cases.US_state.mobility, Date.numeric==max(Corona_Cases.US_state$Date.numeric) ) %>% melt(measure.vars=mobility_measures) 
plot_data$value<-as.numeric(gsub(plot_data$value,pattern = "%",replacement = ""))
plot_data<-filter(plot_data,!is.na(value))

(mobility.plot<-ggplot(filter(plot_data,Province.State %in% c("Pennsylvania","Maryland","New Jersey","California","Delaware","Connecticut")),aes(y=Total_confirmed_cases.per100,x=value))+geom_point()+
  facet_grid(Province.State~variable,scales = "free")+
  xlab("Mobility change from baseline (%)")+
  ylab(paste0("Confirmed cases per 100 people(Today)"))+
  default_theme+
  ggtitle("Mobility change vs cases"))

(mobility.global.plot<-ggplot(plot_data,aes(y=Total_confirmed_cases.per100,x=value))+geom_point()+
  facet_wrap(~variable,scales = "free")+
  xlab("Mobility change from baseline (%)")+
  ylab(paste0("Confirmed cases (Today) per 100 people"))+
  default_theme+
  ggtitle("Mobility change vs cases"))

plot_data.permobility_summary<-ddply(plot_data,c("Province.State","variable"),summarise,cor=cor(y =Total_confirmed_cases.per100,x=value),median_change=median(x=value)) %>% arrange(-abs(cor))

kable(plot_data.permobility_summary,caption = "Ranked per-state mobility correlation with total confirmed cases")
Ranked per-state mobility correlation with total confirmed cases
Province.State variable cor median_change
Alaska Transit -1.0000000 -63.0
Delaware Retail_Recreation 1.0000000 -39.5
Delaware Grocery_Pharmacy 1.0000000 -17.5
Delaware Parks -1.0000000 20.5
Delaware Transit 1.0000000 -37.0
Delaware Workplace 1.0000000 -37.0
Delaware Residential -1.0000000 14.0
Hawaii Transit 0.9776497 -89.0
Hawaii Parks 0.9570596 -72.0
New Hampshire Parks 0.9484318 -20.0
Vermont Parks 0.9265774 -35.5
Maine Transit -0.9178901 -50.0
Connecticut Grocery_Pharmacy -0.8810580 -6.0
Utah Transit -0.8563577 -18.0
Utah Residential -0.8407493 12.0
South Dakota Parks 0.7915494 -26.0
Alaska Residential 0.7783577 13.0
Rhode Island Workplace -0.7691260 -39.5
Arizona Residential 0.7651791 13.0
Arizona Grocery_Pharmacy -0.7633117 -15.0
Connecticut Transit -0.7505952 -50.0
Massachusetts Workplace -0.7500022 -39.0
Nevada Transit -0.7087015 -20.0
Wyoming Parks -0.6990219 -4.0
Idaho Residential -0.6873191 11.0
New York Workplace -0.6548160 -34.5
North Dakota Parks 0.6496633 -34.0
Hawaii Grocery_Pharmacy 0.6464499 -34.0
Vermont Grocery_Pharmacy -0.6300055 -25.0
Alaska Workplace -0.6295776 -33.0
Rhode Island Retail_Recreation -0.6280705 -45.0
Alaska Grocery_Pharmacy -0.6143310 -7.0
Rhode Island Residential -0.5981837 18.5
New Jersey Parks -0.5939412 -6.0
Arkansas Parks 0.5927561 -12.0
Utah Parks -0.5923746 17.0
New York Retail_Recreation -0.5911145 -46.0
Maine Workplace -0.5857020 -30.0
Nebraska Workplace 0.5761242 -32.0
Hawaii Retail_Recreation 0.5414797 -56.0
New York Parks 0.5305934 20.0
Utah Workplace -0.5215122 -37.0
Hawaii Workplace -0.5186694 -46.0
Connecticut Retail_Recreation -0.5169942 -45.0
New Jersey Workplace -0.5168293 -44.0
Maine Parks 0.5130656 -31.0
Connecticut Residential 0.5061124 14.0
Massachusetts Retail_Recreation -0.4934303 -44.0
Arizona Retail_Recreation -0.4930843 -42.5
New Mexico Grocery_Pharmacy -0.4853771 -11.0
New Jersey Grocery_Pharmacy -0.4780741 2.5
Arizona Transit 0.4706964 -38.0
Connecticut Workplace -0.4693656 -39.0
Nebraska Residential -0.4658583 14.0
New Mexico Residential 0.4603303 13.5
Maryland Workplace -0.4599100 -35.0
Rhode Island Parks 0.4463111 52.0
Iowa Parks -0.4381705 28.5
Missouri Residential -0.4336525 13.0
South Carolina Parks -0.4258493 -23.0
Kentucky Parks -0.4234761 28.5
North Dakota Retail_Recreation -0.4230245 -42.0
West Virginia Parks 0.4206866 -33.0
Pennsylvania Workplace -0.4183059 -36.0
Vermont Residential 0.4180763 11.5
Illinois Transit -0.4161649 -31.0
Maryland Grocery_Pharmacy -0.4087339 -10.0
New Jersey Retail_Recreation -0.4077368 -62.5
Montana Residential 0.4071094 14.0
Massachusetts Grocery_Pharmacy -0.4041230 -7.0
New Jersey Transit -0.4007408 -50.5
Pennsylvania Retail_Recreation -0.3956811 -45.0
New Mexico Parks 0.3917083 -31.5
Montana Parks -0.3871650 -58.0
New Hampshire Residential -0.3861781 14.0
Oregon Parks -0.3755760 16.5
New Mexico Retail_Recreation -0.3749902 -42.5
Alabama Workplace -0.3681246 -29.0
New York Transit -0.3650745 -48.0
Michigan Parks 0.3561084 28.5
Montana Transit 0.3559413 -41.0
Alabama Grocery_Pharmacy -0.3480114 -2.0
Wisconsin Transit -0.3423397 -23.5
South Carolina Workplace 0.3423161 -30.0
Nebraska Grocery_Pharmacy 0.3409957 -0.5
Florida Residential 0.3350667 14.0
Maryland Retail_Recreation -0.3291124 -39.0
Virginia Transit -0.3289250 -32.5
Nevada Residential 0.3261226 17.0
Maine Retail_Recreation -0.3224848 -42.0
Alaska Retail_Recreation 0.3200184 -39.0
Colorado Residential 0.3176879 14.0
North Dakota Workplace 0.3158752 -40.0
Nevada Workplace -0.3086487 -40.0
Arkansas Retail_Recreation -0.3062877 -30.0
California Parks -0.3027092 -38.5
Vermont Workplace -0.3003789 -43.0
Illinois Workplace -0.2954330 -30.5
California Residential 0.2905539 14.0
Minnesota Transit -0.2825082 -28.5
Pennsylvania Parks 0.2820241 12.0
Nevada Retail_Recreation -0.2794239 -43.0
Vermont Retail_Recreation 0.2784984 -57.0
West Virginia Grocery_Pharmacy -0.2745297 -6.0
California Transit -0.2734953 -42.0
Missouri Workplace 0.2716631 -29.0
Idaho Grocery_Pharmacy -0.2707111 -5.5
Kansas Workplace 0.2686402 -32.5
Maryland Residential 0.2683666 15.0
Pennsylvania Grocery_Pharmacy -0.2682835 -6.0
Idaho Transit -0.2661419 -30.0
Rhode Island Grocery_Pharmacy 0.2621455 -7.5
North Carolina Workplace 0.2579272 -31.0
North Carolina Grocery_Pharmacy 0.2560785 0.0
Illinois Parks 0.2528532 26.5
Idaho Workplace -0.2522669 -29.5
Georgia Grocery_Pharmacy -0.2496677 -10.0
Tennessee Workplace -0.2473044 -31.0
West Virginia Workplace 0.2467768 -33.0
Rhode Island Transit -0.2460045 -56.0
Alabama Transit -0.2456713 -36.5
Tennessee Parks -0.2452253 10.5
Minnesota Parks 0.2424568 -9.0
Wisconsin Parks 0.2384024 51.5
Tennessee Residential 0.2365539 11.5
Alabama Parks 0.2365178 -1.0
Florida Parks -0.2363339 -43.0
Washington Grocery_Pharmacy 0.2352031 -7.0
South Dakota Workplace 0.2349685 -35.0
Missouri Parks 0.2325147 0.0
New York Grocery_Pharmacy -0.2262891 8.0
Indiana Parks -0.2247129 29.0
Oregon Grocery_Pharmacy -0.2239226 -7.0
Oregon Residential -0.2213890 10.5
Wyoming Grocery_Pharmacy -0.2194181 -10.0
Georgia Workplace -0.2169210 -33.5
Georgia Retail_Recreation -0.2157516 -41.0
Nebraska Transit -0.2141835 -9.0
Texas Workplace 0.2133150 -32.0
Idaho Parks 0.2131382 -19.0
Arizona Parks -0.2120306 -44.5
Hawaii Residential 0.2102403 19.0
North Dakota Grocery_Pharmacy -0.2087224 -8.0
Connecticut Parks 0.2019094 43.0
Illinois Residential 0.1980812 14.0
Kansas Grocery_Pharmacy -0.1936450 -14.5
South Dakota Residential 0.1897724 15.0
North Carolina Transit 0.1878768 -32.0
Colorado Parks -0.1869109 2.0
Virginia Residential 0.1858396 14.0
Mississippi Grocery_Pharmacy -0.1837512 -8.0
Texas Residential -0.1836051 15.0
Wisconsin Residential -0.1832962 14.0
Wyoming Workplace -0.1826951 -31.0
Oklahoma Parks 0.1812358 -18.5
New Hampshire Retail_Recreation -0.1751879 -41.0
Pennsylvania Transit -0.1744779 -42.0
Oregon Transit 0.1685490 -27.5
Ohio Transit 0.1684033 -28.0
North Carolina Residential 0.1648259 13.0
Indiana Residential 0.1644697 12.0
Mississippi Residential 0.1643131 13.0
Massachusetts Parks 0.1630229 39.0
Utah Retail_Recreation -0.1604607 -40.0
Kentucky Residential 0.1602447 12.0
Kentucky Grocery_Pharmacy 0.1564833 4.0
Kentucky Transit -0.1540014 -31.0
Montana Retail_Recreation 0.1527703 -51.0
New Mexico Transit 0.1492338 -38.5
North Dakota Residential -0.1482301 17.0
Oregon Retail_Recreation 0.1388651 -40.5
New Jersey Residential 0.1387730 18.0
West Virginia Residential -0.1365126 11.0
Minnesota Retail_Recreation 0.1341718 -40.5
Texas Parks 0.1323216 -42.0
Florida Transit -0.1321455 -49.0
Iowa Transit 0.1305030 -24.0
Mississippi Retail_Recreation -0.1300161 -40.0
Utah Grocery_Pharmacy 0.1276307 -4.0
Virginia Grocery_Pharmacy -0.1275577 -8.0
Wisconsin Grocery_Pharmacy 0.1266344 -1.0
South Carolina Transit -0.1254859 -45.0
North Carolina Retail_Recreation 0.1237610 -34.0
Oklahoma Residential -0.1232078 15.0
Indiana Retail_Recreation 0.1226963 -38.0
Kansas Transit -0.1222306 -23.0
North Dakota Transit 0.1209435 -48.0
Michigan Workplace -0.1187106 -40.0
Texas Grocery_Pharmacy 0.1170093 -14.0
Idaho Retail_Recreation -0.1167041 -40.0
Vermont Transit -0.1163200 -63.0
Mississippi Parks -0.1154960 -25.0
Wisconsin Workplace -0.1106412 -31.0
New Hampshire Grocery_Pharmacy -0.1095949 -6.0
Oklahoma Grocery_Pharmacy -0.1067165 -0.5
Nebraska Retail_Recreation 0.1062651 -36.0
Iowa Workplace 0.1030356 -30.0
Alabama Retail_Recreation 0.1013376 -39.0
Oklahoma Workplace 0.1003382 -31.0
Iowa Grocery_Pharmacy -0.0999713 4.0
Virginia Parks 0.0998608 6.0
Minnesota Grocery_Pharmacy 0.0994224 -6.0
Massachusetts Transit -0.0992764 -45.0
Massachusetts Residential 0.0991267 15.0
Maryland Transit -0.0985531 -39.0
Missouri Transit -0.0981786 -24.5
Indiana Workplace 0.0944700 -34.0
Arkansas Workplace -0.0943607 -26.0
California Grocery_Pharmacy -0.0900293 -11.5
New York Residential 0.0888423 17.5
Alabama Residential -0.0884444 11.0
Nevada Parks 0.0877191 -12.5
Michigan Transit 0.0876931 -46.0
Georgia Transit -0.0839308 -35.0
Virginia Retail_Recreation -0.0796687 -35.0
Arkansas Transit -0.0793940 -27.0
California Retail_Recreation -0.0791402 -44.0
Wyoming Residential 0.0770536 12.5
Arkansas Residential -0.0764416 12.0
Wyoming Transit -0.0763411 -17.0
West Virginia Transit -0.0759173 -45.0
Ohio Residential 0.0731418 14.0
Washington Transit -0.0718212 -33.5
Pennsylvania Residential 0.0710866 15.0
West Virginia Retail_Recreation -0.0703061 -38.5
Ohio Grocery_Pharmacy 0.0701392 0.0
Florida Retail_Recreation 0.0686704 -43.0
Washington Retail_Recreation 0.0685704 -42.0
Michigan Retail_Recreation -0.0681553 -53.0
Montana Workplace -0.0671351 -40.0
Michigan Residential 0.0659481 15.0
North Carolina Parks -0.0643948 7.0
Minnesota Residential -0.0632646 17.0
Kentucky Retail_Recreation 0.0627533 -29.0
South Dakota Transit -0.0601792 -40.0
Virginia Workplace -0.0593562 -32.0
South Dakota Retail_Recreation -0.0586921 -39.0
Iowa Retail_Recreation 0.0571131 -38.0
Kansas Parks 0.0568457 72.0
Texas Retail_Recreation 0.0568120 -40.0
Georgia Parks 0.0545149 -6.0
Tennessee Transit 0.0540167 -32.0
Florida Workplace 0.0510858 -33.0
Indiana Transit 0.0507215 -29.0
New Hampshire Transit 0.0493571 -57.0
Arkansas Grocery_Pharmacy -0.0481090 3.0
Wyoming Retail_Recreation 0.0477724 -39.0
Colorado Retail_Recreation -0.0446004 -44.0
Arizona Workplace -0.0429453 -35.0
New Mexico Workplace 0.0427490 -34.0
Ohio Workplace -0.0414062 -35.0
Texas Transit 0.0412115 -41.5
Illinois Retail_Recreation 0.0405321 -40.0
New Hampshire Workplace 0.0397333 -37.0
Kentucky Workplace -0.0385028 -36.5
Ohio Retail_Recreation 0.0380953 -36.0
Nebraska Parks 0.0379881 55.5
South Carolina Residential -0.0366363 12.0
Tennessee Retail_Recreation -0.0344703 -30.0
Colorado Grocery_Pharmacy -0.0338593 -17.0
Missouri Grocery_Pharmacy -0.0338556 2.0
Mississippi Workplace 0.0335672 -33.0
South Carolina Grocery_Pharmacy -0.0334394 1.0
Georgia Residential -0.0309857 13.0
Maine Residential -0.0299911 11.0
Oklahoma Retail_Recreation -0.0293776 -31.0
Montana Grocery_Pharmacy 0.0239662 -16.0
Illinois Grocery_Pharmacy -0.0236102 2.0
Maine Grocery_Pharmacy -0.0230020 -13.0
Washington Residential -0.0222806 13.0
Oklahoma Transit 0.0201778 -26.0
Minnesota Workplace -0.0201241 -33.0
Washington Workplace 0.0198230 -38.0
Colorado Transit 0.0190398 -36.0
Mississippi Transit 0.0189948 -38.5
Missouri Retail_Recreation -0.0184385 -36.0
Oregon Workplace 0.0180970 -31.0
South Dakota Grocery_Pharmacy -0.0168920 -9.0
Ohio Parks -0.0161482 67.5
Kansas Residential -0.0155494 13.0
California Workplace 0.0147213 -36.0
Iowa Residential 0.0140151 13.0
Maryland Parks 0.0127710 27.0
Michigan Grocery_Pharmacy 0.0068606 -11.0
Tennessee Grocery_Pharmacy 0.0063892 6.0
Nevada Grocery_Pharmacy 0.0033088 -12.5
Wisconsin Retail_Recreation 0.0031699 -44.0
Florida Grocery_Pharmacy 0.0015084 -14.0
South Carolina Retail_Recreation -0.0013071 -35.0
Colorado Workplace -0.0012755 -39.0
Indiana Grocery_Pharmacy 0.0006488 -5.5
Washington Parks 0.0002564 -3.5
Kansas Retail_Recreation -0.0002100 -38.0
Alaska Parks NA 29.0
District of Columbia Retail_Recreation NA -69.0
District of Columbia Grocery_Pharmacy NA -28.0
District of Columbia Parks NA -65.0
District of Columbia Transit NA -69.0
District of Columbia Workplace NA -48.0
District of Columbia Residential NA 17.0
# sanity check
ggplot(filter(plot_data,Province.State %in% c("Pennsylvania","Maryland","New Jersey","California","Delaware","Connecticut")),aes(x=Total_confirmed_cases.per100,fill=variable))+geom_histogram()+
  facet_grid(~Province.State)+
    default_theme+
  theme(legend.position = "bottom")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

write_plot(mobility.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/mobility.plot.png"
write_plot(mobility.global.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/mobility.global.plot.png"
(plot_data.permobility_summary.plot<-ggplot(plot_data.permobility_summary,aes(x=variable,y=median_change))+
  geom_jitter(size=2,width=.2)+
  #geom_jitter(data=plot_data.permobility_summary %>% arrange(-abs(median_change)) %>% head(n=15),aes(col=Province.State),size=2,width=.2)+
  default_theme+
  ggtitle("Per-Sate Median Change in Mobility")+
  xlab("Mobility Meaure")+
  ylab("Median Change from Baseline"))

write_plot(plot_data.permobility_summary.plot,wd = results_dir)
## [1] "/Users/stevensmith/Projects/coronavirus/results/plot_data.permobility_summary.plot.png"

DELIVERABLE MANIFEST

The following link to commited documents pushed to github. These are provided as a convienence, but note this is a manual process. The generation of reports, plots and tables is not coupled to the execution of this markdown. ## Report This report, html & pdf

Plots

github_root<-"https://github.com/sbs87/coronavirus/blob/master/"

plot_handle<-c("Corona_Cases.world.long.plot",
               "Corona_Cases.world.loglong.plot",
               "Corona_Cases.world.mortality.plot",
               "Corona_Cases.world.casecor.plot",
               "Corona_Cases.city.long.plot",
               "Corona_Cases.city.loglong.plot",
               "Corona_Cases.city.mortality.plot",
               "Corona_Cases.city.casecor.plot",
               "Corona_Cases.city.long.normalized.plot",
               "Corona_Cases.US_state.lm.plot",
               "Corona_Cases.US_state.summary.plot")


deliverable_manifest<-data.frame(
  name=c("World total & death cases, longitudinal",
         "World log total & death cases, longitudinal",
         "World mortality",
         "World total & death cases, correlation",
         "City total & death cases, longitudinal",
         "City log total & death cases, longitudinal",
         "City mortality",
         "City total & death cases, correlation",
         "City population normalized total & death cases, longitudinal",
         "State total cases (select) with linear model, longitudinal",
         "State total cases, longitudinal"),
  plot_handle=plot_handle,
  link=paste0(github_root,"results/",plot_handle,".png")
)


(tmp<-data.frame(row_out=apply(deliverable_manifest,MARGIN = 1,FUN = function(x) paste(x[1],x[2],x[3],sep=" | "))))
##                                                                                                                                                                                                        row_out
## 1                                           World total & death cases, longitudinal | Corona_Cases.world.long.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.long.plot.png
## 2                                 World log total & death cases, longitudinal | Corona_Cases.world.loglong.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.loglong.plot.png
## 3                                                         World mortality | Corona_Cases.world.mortality.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.mortality.plot.png
## 4                                      World total & death cases, correlation | Corona_Cases.world.casecor.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.casecor.plot.png
## 5                                              City total & death cases, longitudinal | Corona_Cases.city.long.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.plot.png
## 6                                    City log total & death cases, longitudinal | Corona_Cases.city.loglong.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.loglong.plot.png
## 7                                                            City mortality | Corona_Cases.city.mortality.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.mortality.plot.png
## 8                                         City total & death cases, correlation | Corona_Cases.city.casecor.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.casecor.plot.png
## 9  City population normalized total & death cases, longitudinal | Corona_Cases.city.long.normalized.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.normalized.plot.png
## 10                     State total cases (select) with linear model, longitudinal | Corona_Cases.US_state.lm.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.lm.plot.png
## 11                                      State total cases, longitudinal | Corona_Cases.US_state.summary.plot | https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.summary.plot.png
row_out<-apply(tmp, 2, paste, collapse="\t\n")
name handle link
World total & death cases, longitudinal Corona_Cases.world.long.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.long.plot.png
World log total & death cases, longitudinal Corona_Cases.world.loglong.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.loglong.plot.png
World mortality Corona_Cases.world.mortality.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.mortality.plot.png
World total & death cases, correlation Corona_Cases.world.casecor.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.world.casecor.plot.png
City total & death cases, longitudinal Corona_Cases.city.long.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.plot.png
City log total & death cases, longitudinal Corona_Cases.city.loglong.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.loglong.plot.png
City mortality Corona_Cases.city.mortality.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.mortality.plot.png
City total & death cases, correlation Corona_Cases.city.casecor.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.casecor.plot.png
City population normalized total & death cases, longitudinal Corona_Cases.city.long.normalized.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.city.long.normalized.plot.png
State total cases (select) with linear model, longitudinal Corona_Cases.US_state.lm.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.lm.plot.png
State total cases, longitudinal Corona_Cases.US_state.summary.plot https://github.com/sbs87/coronavirus/blob/master/results/Corona_Cases.US_state.summary.plot.png

Tables

CONCLUSION

Overall, the trends of COVID-19 cases is no longer in log-linear phase for world or U.S. (but some regions like MD are still in the log-linear phase). Mortality rate (deaths/confirmed RNA-based cases) is >1%, with a range depending on region. Mobility is not a strong indicator of caseload (U.S. data).

See table below for detailed breakdown.

Question Answer
What is the effect on social distancing, descreased mobility on case load?
There is not a strong apparent effect on decreased mobility (work, grocery, retail) or increased mobility (at residence, parks) on number of confirmed cases, either as a country (U.S.) or state level. California appears to have one of the best correlations, but this is a mixed bag
What is the trend in cases, mortality across geopgraphical regions?
The confirmed total casees and mortality is overall log-linear for most countries, with a trailing off beginning for most (inlcuding U.S.). On the state level, NY, NJ, PA starting to trail off; MD is still in log-linear phase. Mortality and case load are highly correlated for NY, NJ, PA, MD. The mortality rate flucutates for a given region, but is about 3% overall.

END

End: ##—— Sat Jun 27 20:29:10 2020 ——##

Cheatsheet: http://rmarkdown.rstudio.com>

Sandbox

# Geographical heatmap!
install.packages("maps")
library(maps)
library
mi_counties <- map_data("county", "pennsylvania") %>% 
  select(lon = long, lat, group, id = subregion)
head(mi_counties)

ggplot(mi_counties, aes(lon, lat)) + 
  geom_point(size = .25, show.legend = FALSE) +
  coord_quickmap()
mi_counties$cases<-1:2226
name_overlaps(metadata,Corona_Cases.US_state)

tmp<-merge(Corona_Cases.US_state,metadata)
ggplot(filter(tmp,Province.State=="Pennsylvania"), aes(Long, Lat, group = as.factor(City))) +
  geom_polygon(aes(fill = Total_confirmed_cases), colour = "grey50") + 
  coord_quickmap()


ggplot(Corona_Cases.US_state, aes(Long, Lat))+
  geom_polygon(aes(fill = Total_confirmed_cases ), color = "white")+
  scale_fill_viridis_c(option = "C")
dev.off()


require(maps)
require(viridis)

world_map <- map_data("world")
ggplot(world_map, aes(x = long, y = lat, group = group)) +
  geom_polygon(fill="lightgray", colour = "white")

head(world_map)
head(Corona_Cases.US_state)
unique(select(world_map,c("region","group"))) %>% filter()

some.eu.countries <- c(
  "US"
)
# Retrievethe map data
some.eu.maps <- map_data("world", region = some.eu.countries)

# Compute the centroid as the mean longitude and lattitude
# Used as label coordinate for country's names
region.lab.data <- some.eu.maps %>%
  group_by(region) %>%
  summarise(long = mean(long), lat = mean(lat))

unique(filter(some.eu.maps,subregion %in% Corona_Cases.US_state$Province.State) %>% select(subregion))
unique(Corona_Cases.US_state$Total_confirmed_cases.log)
ggplot(filter(Corona_Cases.US_state,Date=="2020-04-17") aes(x = Long, y = Lat)) +
  geom_polygon(aes( fill = Total_confirmed_cases.log))+
  #geom_text(aes(label = region), data = region.lab.data,  size = 3, hjust = 0.5)+
  #scale_fill_viridis_d()+
  #theme_void()+
  theme(legend.position = "none")
library("sf")
library("rnaturalearth")
library("rnaturalearthdata")

world <- ne_countries(scale = "medium", returnclass = "sf")
class(world)
ggplot(data = world) +
    geom_sf()

counties <- st_as_sf(map("county", plot = FALSE, fill = TRUE))
counties <- subset(counties, grepl("florida", counties$ID))
counties$area <- as.numeric(st_area(counties))
#install.packages("lwgeom")
class(counties)
head(counties)
ggplot(data = world) +
    geom_sf(data=Corona_Cases.US_state) +
    #geom_sf(data = counties, aes(fill = area)) +
  geom_sf(data = counties, aes(fill = area)) +
   # scale_fill_viridis_c(trans = "sqrt", alpha = .4) +
    coord_sf(xlim = c(-88, -78), ylim = c(24.5, 33), expand = FALSE)


head(counties)
tmp<-unique(select(filter(Corona_Cases.US_state,Date=="2020-04-17"),c(Lat,Long,Total_confirmed_cases.per100)))
st_as_sf(map("county", plot = FALSE, fill = TRUE))

join::inner_join.sf(Corona_Cases.US_state, counties)

library(sf)
library(sp)

nc <- st_read(system.file("shape/nc.shp", package="sf"))
class(nc)


spdf <- SpatialPointsDataFrame(coords = select(Corona_Cases.US_state,c("Lat","Long")), data = Corona_Cases.US_state,
                               proj4string = CRS("+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0"))

head(spdf)
class(spdf)
st_cast(spdf)

filter(Corona_Cases.US_state.summary,Date=="2020-04-20" & Province.State %in% top_states_modified)
id

https://stevenbsmith.net